This tutorial will walk you through the entire data science pipeline starting from data collection and processing, then moving on to exploratory data analysis and data visualization. Next, we will use hypothesis testing and machine learning to provide analysis. Lastly, we will show the messages covering insight learned during the tutorial. However, this tutorial will be focusing more on the data processing and analysis using visualization created using Pyplot and Plotly library.

The data set we will use to analyze is the Homicide Reports (1980-2014) from FBI and FOIA which can be download here. The reason for choosing this data is because it contains many variables that we are able to do variety of analyzsis with from different angles. In addition, by analying the homicide reports and looking at the number of cases hopefully we can find the trends and be more aware of how series it can be.
I imported the data from the csv file and replaced any unknow data to np.nan, as seen from the code below.
import pandas as pd
import numpy as np
data = pd.read_csv("database.csv", low_memory=False)
df = pd.DataFrame(data)
#replace unknown data to np.nan
df.replace('Unknown',np.nan, inplace='true')
df.head()
The colde below generates the graph of Number of Homicide by Year. In this plot, we are interested in seeing the number of cases each year from 1980 to 2014. To count the number of cases instead of directly counting the incident column, I have used the size() method of groupby to count the number of rows in each group because I thought it might be a little bit more accurate since 2 incidents can be between the same victim and perpetrator.
import matplotlib.pyplot as plt
plt.style.use('ggplot')
g = df.groupby('Year')
years = sorted(g.groups.keys())
size = g.size().values.ravel()
fig, ax = plt.subplots()
ax.plot(years, size, marker='.', linestyle='-', ms=5, color = 'purple', alpha = .5)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 1))
ax.set_xlabel('Year')
ax.set_ylabel('Number of Homicide')
ax.set_title("1980-2014 Number of Homicide by Year")
plt.xticks(rotation=90)
plt.show()
plt.close("all")
From the plot above we can immediately see that there is a huge declined in the number of homicide in the late 1990s. Unfortunately, after some research there still seems to be no definite answer for the caused of the declined. However, there are articles that talk about some of the hypothesis for the declined. The links are provided below:
To help us have a better understanding of the plot, we can use sklearn that fit a linear regression model into the above graph.
from sklearn import linear_model
regr = linear_model.LinearRegression()
#fitting the regresson model
x = years
y = size
x = np.reshape(x,(-1,1))
y = y.reshape(-1,1)
regr.fit(x, y)
fig, ax = plt.subplots()
ax.plot(years, size, marker='.', linestyle='None', ms=5, color = 'purple', alpha = .5)
ax.plot(years, regr.predict(x).ravel(), color='blue', alpha= .5)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 1))
ax.set_xlabel('Year')
ax.set_ylabel('Number of Homicide')
ax.set_title("1980-2014 Number of Homicide by Year")
plt.xticks(rotation=90)
plt.show()
plt.close("all")
From the linear regression line that sklearn generated, we can see that even though there are few outliers around the year of 92 to 94 and 99-2000, the line has a negative slope. This means that the overall trend for number of homicide is decreasing as year passed. This leads us to the next part. Is there a relationship between the victim and the perpertrator.
For the next plot, we would like to see the relationship between the victim and the perpetrator. The "Other" category includes some of the closer relationships which are listed below:
Neighbor
Boyfriend/Girlfriend
Friend
Family
Common-Law Husband
Common-Law Wife
Stepdaughter
Stepfather
Stepmother
Stepson
Ex-Husband
Ex-Wife
Employee
Employer
We will group the data by using year and relationship between the victim and the perpetrator. It seperates into three category: stranger, acquaintance, and other. It will then count the total number for each category as seen in the code below.
g1 = df.groupby(['Year','Relationship'])
g1 = g1.size()
#Create a new dataframe that has the year as index and three colums indicting number of 'Stranger','Acquaintance', and 'Other'
df2 = pd.DataFrame(index = years, columns = ['Stranger', 'Acquaintance', 'Other'])
stranger = []
acq= []
other = []
y = 1980
s = 0
#counting the total number of specific relationship for each year
for index, series in g1.iteritems():
if(index[1] == 'Stranger'):
stranger.append(series)
elif (index[1] == 'Acquaintance'):
acq.append(series)
else:
s += series
if(y == index[0]-1):
y += 1
other.append(s)
s = 0
other.append(s)
df2['Stranger'] = stranger
df2['Acquaintance'] = acq
df2['Other'] = other
f, ax1 = plt.subplots(1, figsize=(20,6))
ax1.set_xlabel('Year')
ax1.set_ylabel('Number of Homicide')
ax1.set_title("Relationship Between Victim and Perpetrator")
df2.plot.bar(stacked=True,ax=ax1, alpha = .5, width = .8, color =['#F4561D','#F1911E','#F1BD1A'])
plt.show()
From the above bar graph, we can see the decrease number of homicide in all three category of relationships we analyzed. However, while the number of cases start off pretty close for closer relationships and acquaintance, we can see the decreasing trend for the acquaintance is more obvious than other. We can also see that the number of homicide casued by stangers did not seems to decrease a lot compare to other two, the number even become very close to the number of acquaintance starting around 2000.
After we have seen the number of homicide by the relationships between victim and perpetrator we might also want to know the gender of the victim and perpetrator, so we have included the graphs for the number of homicide victim and perpetrator by gender.
g2 = df.groupby(['Year','Victim Sex'])
#reshape panda series to one column of # of Female Victim and one column of # Male Victim
g2 = g2.size().values.reshape(35,2)
df3 = pd.DataFrame(index = years, columns =['#Female Victim','#Male Victim'], data=g2)
f, ax2 = plt.subplots(1, figsize=(20,6))
ax2.set_xlabel('Year')
ax2.set_ylabel('Number of Victim')
ax2.set_title("Sex of Homicide Victim")
df3.plot.bar(ax=ax2, color=['r','b'],alpha=0.5, width=0.8)
g3 = df.groupby(['Year','Perpetrator Sex'])
g3 = g3.size().values.reshape(35,2)
df4 = pd.DataFrame(index = years, columns =['#Female Perpetrator','#Male Perpetrator'], data=g3)
f, ax3 = plt.subplots(1, figsize=(20,6))
ax3.set_xlabel('Year')
ax3.set_ylabel('Number of Perpetrator')
ax3.set_title("Sex of Homicide Perpetrator")
df4.plot.bar(ax=ax3, color=['r','b'],alpha=0.5, width=0.8)
plt.show()
It might not be very surprising to see male number to be much more than female, but it is interesting to see that the trend for victim and perpetrator seems almost identical. From the resulted graphs, we also noticed the number for the perpetrator is less than the number of victim which is because the lack of information of the perpetrator for the unsolved cases. So, for the next plot we will show the percentage of homicide solved each year.
To get the percentage of crime solved each year, we grouped the data by the year and crime solved. The data are seperated by whether or not the cases have been solved.
g4 = df.groupby(['Year','Crime Solved'])
#reshape panda series to one column of # of Female Victim and one column of # Male Victim
g4 = g4.size().values.reshape(35,2)
df5 = pd.DataFrame(index = years, columns =['#Not Solved','#Solved'], data=g4)
#calculate the crime solve percentage
df5['Crime Solved %'] = (df5['#Solved']/(df5['#Solved']+df5['#Not Solved'])*100)
After that, plot the data with linear regression line to generate the graph below.
x = years
y = df5['Crime Solved %']
x = np.reshape(x, (-1,1))
y = y.values.reshape(-1,1)
regr = linear_model.LinearRegression()
#fitting the regresson model
regr.fit(x, y)
fig, ax5 = plt.subplots()
ax5.plot(df5.index, df5['Crime Solved %'], marker='.', linestyle='None', ms=5, color = 'orange')
ax5.xaxis.set_ticks(np.arange(start, end, 1))
ax5.set_xlabel('Year')
ax5.set_ylabel('Percentage of Homicide Solved')
ax5.set_title("1980-2014 Percentage of Homicide Solved by Year")
# plot the regression line
ax5.plot(x.ravel(), regr.predict(x).ravel(), color='blue', alpha= .5)
plt.xticks(rotation=90)
plt.show()
The result was a little bit unexpected for me. At first, I thought we can see a clear linear line of increase percentage of solved homicide cases because of the technology improvement and since we can learn from previous experience. However, we can see from the regression line, the percentage of solved cases is actually decreasing as time passed. The reason could be because perpetrators also have the knowledge and the technology that make the cases harder to solve.
Choropleth map is another power technique that provides strong visualization. Below we will use the choropleth map to show the total number of homicide cases from 1980 to 2014.
g5 = df.groupby('State')
g5 = g5.size()
# since we can not show DC in the 50 states map we decided to add the number of DC into Maryland since DC is located at Maryland
maryland_total = g5.get('District of Columbia') + g5.get('Maryland')
g5.set_value('Maryland',maryland_total)
# We will need to translate the states name into state code for the plotly map to process the data
code = ['AL','AK', 'AZ', 'AR','CA', 'CO','CT','DE','DC','FL',
'GA', 'HI', 'ID','IL','IN','IA', 'KS','KY', 'LA', 'ME', 'MD','MA','MI',
'MN', 'MS', 'MO','MT', 'NE', 'NV', 'NH','NJ','NM','NY', 'NC', 'ND', 'OH',
'OK', 'OR','PA','RI','SC', 'SD','TN', 'TX','UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
g5.keys= code
After taking the data, and grouping them by states, we were able to generate the map below using plotly:
import plotly
#must added this code to use plotly offline so you do not have to have an account with plotly
plotly.offline.init_notebook_mode()
#we created the purple color scale to use
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
data = [ dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = g5.keys,
z = g5.values,
locationmode = 'USA-states',
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
) ),
colorbar = dict(
title = "Number of cases")
) ]
layout = dict(
title = '1980 - 2014 Number of Homicide by State',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
)
fig = dict( data=data, layout=layout )
plotly.offline.iplot( fig )
From the result, we can make a prediction that states that have higher population also have higher number of cases. To prove our hypothesis, we have also gather the state population data from Census The state_poulation data was gathered by looking up state population each year from 1980 to 2014 manually and put it in the excel document.
popdf = pd.read_excel("state_population.xlsx")
#adding extra columns to include number of cases which can be use later
popdf['Cases']= g5.values
#same as the pervious data we also want to add the data of DC into MD and take the average of each year population
m = popdf.loc[popdf['State']=='MD']
d = popdf.loc[popdf['State']=='DC']
t = m.values+d.values
t = np.delete(t,0)
t = np.delete(t,35)
a = np.mean(t)
popdf.set_value(20,'Average', a)
popdf.head()
plotly.offline.init_notebook_mode()
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
data = [ dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = popdf['State'],
z = popdf['Average'],
locationmode = 'USA-states',
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
) ),
colorbar = dict(
title = "Population")
) ]
layout = dict(
title = '1980 - 2014 Average Population by State',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
)
fig = dict( data=data, layout=layout )
plotly.offline.iplot( fig )
The result from the average population data agrees with our hypothesis and looks almost identical as the pervious choropleth map.
To have a better view of the relationship between number of homicide by state and the average population by state, we can again fit a linear model to draw the regression line between them.
fig, ax6 = plt.subplots()
ax6.plot(popdf.Cases,popdf.Average, linestyle='None', marker='.' )
x = popdf.Cases
y = popdf.Average
x = x.values.reshape(-1,1)
y = y.values.reshape(-1,1)
regr = linear_model.LinearRegression()
regr.fit(x, y)
ax6.plot(x.ravel(), regr.predict(x).ravel(), color='blue', alpha= .5)
#removing auto offset and scientific notation fo large number
ax6.ticklabel_format(useOffset=False, style='plain')
ax6.set_xlabel('Number of Homicide')
ax6.set_ylabel('Average State Population')
ax6.set_title("1980-2014 Total Number of Homicide vs. States Average Population")
plt.show()
The result clearly shows a positive relation between the number of homicide and the population. As the increase of the population increase the number of homicide cases also increase.
Motion bubble chart is another visualization tool that can help us to see the relationship between different variances. For our motion bubble chart we will show the how each state’s population and number of homicide changed each year.
To prepare our data for the motion bubble chart, we will need to rearrange our data to get the number of cases each year for each state. However, we have found out that we did not have the data of some of the states for some of the years. We have looked back to the original data source, but it did not mention about the missing data, so we are not able to find out why are the data missing….
import queue
dfState = df.groupby(['State','Year']).size()
#find and fill missing data using queue to check the years range for each state
years = queue.Queue()
#range is inclusive for the start values and exclusive for the end value
for j in range(1980,2015):
years.put(j)
#iter rows in the dataframe and find the years that each states is missing a data and add np.nan to it
for i, row in dfState.iteritems():
if(years.empty()):
for j in range(1980,2015):
years.put(j)
y = years.get()
if(type(i) != int):
if(i[1] != y):
for x in range(y, i[1]):
dfState.loc[(i[0],x)] = np.nan
y = years.get()
#tranfer pandas series to data frame
dataState = dfState.to_frame('Crime')
#making year and state columns
dataState = dataState.reset_index()
#sort dataframe first by year then by state
dataState.sort_values(by=['Year', 'State'], inplace=True)
#now we want to add the population data to our dataframe
#drop the unused columns
temp_pop = popdf.drop('Average',1)
temp_pop.drop('Cases', 1,inplace=True)
temp_pop.drop('State', 1,inplace=True)
temp_pop = temp_pop.transpose()
pop = temp_pop.as_matrix()
dataState['Population'] = pop.reshape(1785,1)
dataState['pop'] = dataState['Population']
#rearrange to use it in the bubble chart
dataState = dataState[['Year','pop','Crime','Population','State']]
To handle the missing data, we have decided to discard all the missing data, so the number of crime cases will be 0 for the data that is missing.
Finally we can import motionchart and pass dataframe to the motion chart The bubble size will be determine by the size of the state population The y-axis for the motion chart will be the population and year, the x-axis will be the number of crime
You can click on the bubble that you want to know the name of, also you can move the mouse on the bubble to see more information.
from motionchart.motionchart import MotionChart, MotionChartDemo
mChart = MotionChart(df=dataState, title = "Crime Cases by States")
mChart.to_notebook()
Motion bubble chart has allowed us to see some of the outliers, which includes California, Taxes, New York, Florida, and Illinois. It is very interesting to see that Texas population caught up and eventually became larger than New York. However, Taxes had more crime cases then New York even before the population is smaller than New York. Furthermore, we can also see that as Florida also caught up with New York population, it also had more crime cases than New York. It seems like the fast growing state could have more crime cases than other state.
This tutorial only highlighted some of the basic elements of data analyzed in Python. Much more different ways of handling and analyzing data can be done. Especially when it comes to handling missing data and machine learning. More detail and tools are available from the following links.
Data:
Visualization Tool:
Others :